Genetic Epidemiology
○ Wiley
Preprints posted in the last 90 days, ranked by how well they match Genetic Epidemiology's content profile, based on 46 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Singh Sachan, A. N.; Schwartzman, A.; Azriel, D.
Show abstract
SNP-heritability is defined as the fraction of variance of a trait that is explained by the SNPs in a genome-wide association study. Several methodologies have been proposed to estimate this quantity. More recent methods aim to do so with ancestrally diverse datasets and yet obtain a single heritability for an entire dataset, which we refer to as marginal heritability. However, the different underlying subpopulations that compose a genetically diverse dataset might have different environmental and genetic exposures, and thus may have different heritabilities. In order to address this, we propose a conditional SNP-heritability approach that allows to estimate multiple SNP-heritabilities on a dataset corresponding to different ancestral compositions and environmental exposures. We take a careful statistical approach, including estimation of conditional genetic and environmental variances, and calculation of standard errors via a combination of the delta method with bootstrapping. We validate our method via extensive simulations. We then apply it to an ancestrally and socio-economically diverse dataset of 6603 subjects aged around 9 to 11 from the Adolescent Brain Cognitive Development study, and illustrate how the SNP-heritability of intelligence scores can change due to differing extrinsic variances in different socio-economic groups, which coincides with previous work in the literature. This conditional estimation approach can be a valuable tool for understanding differences in risks across subpopulations. Our work here improves on existing methodology and allows us to leverage the heterogeneity of the data to obtain new insights.
Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.
Show abstract
Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.
Miao, X.; Edge, M. D.; Harpak, A.
Show abstract
Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.
Wang, J.; Morrison, J.
Show abstract
1Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal relationships between complex traits. Standard MR can be used to estimate an average causal effect at the population level, and typically assumes a linear exposure-outcome relationship. Recently, several methods for estimating nonlinear effects have been developed. However, many have been found to produce spurious empirical findings when subjected to negative control analyses. We propose that this poor performance may be attributable to heterogeneity in variant-exposure associations. We demonstrate that heterogeneous genetic effects on exposure lead to biased estimates, poor coverage, and inflated type I error in control function and stratification-based methods. In contrast, two-stage least squares (TSLS) methods are robust to such heterogeneity, but suffer from low precision and low power in some circumstances. We show that a statistical test for heterogeneity can be used to guide the choice of nonlinear MR methods. Using UK Biobank data, we reassess the causal effects of BMI, vitamin D, and alcohol consumption on blood pressure, lipid, C-reactive protein, and age (negative control). We find strong evidence of heterogeneity for all three exposures, and also recapitulate previous results that control function and stratification-based methods are prone to false positives. Finally, using nonparametric TSLS, we identify evidence of nonlinear causal effects of BMI on HDL cholesterol, triglycerides, and C-reactive protein; however, specific estimates of the shape of these relationships are imprecise. Altogether, our results suggest that common nonlinear MR methods are unreliable in the presence of realistic levels of heterogeneity, and that more methodological development is required before practically useful nonlinear MR is feasible.
Pham, B. K.; Davenport, S.; Azriel, D.; Schwartzman, A.
Show abstract
LD Score Regression (LDSC) is a prominent method, which estimates whole-genome SNP heritability from summary statistics via the slope of a linear regression of GWAS test statistics corresponding to a trait of interest against LD scores. It was claimed by the LDSC authors that the free intercept in the regression accounts for confounding bias such as population stratification. In this study, we argue that the intercept in LDSC must be fixed to 1 for accurate SNP heritability estimation. We show both theoretically and with simulations that the estimated intercept does not accurately capture population stratification effects, and that it adversely affects the accuracy of the heritability estimate introducing bias and increasing variance. Fixing the intercept to 1 eliminates bias and reduces variance when no population stratification is present. On the other hand, under population stratification, LDSC is biased with both the free and the fixed intercept. Additionally, we show that estimated standard errors in LDSC are underestimated, potentially leading to false-positives in downstream GWAS analyses.
Webster, A. J.; Drakesmith, C. W.; Perera-Salazar, R.; Steinsaltz, D.; COMPUTE team,
Show abstract
Biomarker measurements can assist with disease diagnosis and the assessment of disease risks, with the most recent measurements usually used by disease-risk models. However, a growing number of studies suggest that in addition to a biomarkers value, its inherent variability, estimated from several measurements over many days or years in an individual, can convey independent prognostic information about disease risks. Variance estimates require an individuals biomarker data to have been measured a sufficient number of times, ideally across a long time period, and are usually only available in a hospital setting or clinical trial. Furthermore, a single biomarker measurement will involve a combination of measurement-error, natural short-term variation over a daily time-period, variation over time periods of weeks and months, and slower age-dependent changes over several years. This paper develops a statistical method that accounts for these latter concerns, and applies it to Clinical Practice Research Datalink (CPRD) data collected by UK General Practitioners. It studies the associations between cardiovascular health outcomes and the within-person variances of eight routinely measured biomarkers. This involved Sequential Monte Carlo modeling to convert an individuals biomarker measurements (collected over months or years), into estimates for the biomarkers mean, linear age-dependent slope, within-person variance, and a variance due to variation on a daily time period or measurement errors. The result is a proof-of-principle that UK primary care Electronic Health Records (from CPRD) can be effectively used for this purpose. After adjusting for mean biomarker values, clear associations were found between mortality or cardiovascular disease risks and within-person variances for 6 of 8 biomarkers.
Bazemore, K.; Iqbal, T.; Kuzma, A. B.; Grant, S. F. A.; Schellenberg, G. D.; Wang, L.-S.; Chesi, A.; Jin, J.; Naj, A. C.
Show abstract
Pathway-specific polygenic risk scores (pathway-PRS) measure aggregate genetic risk across single nucleotide variants (SNVs) annotated to genes in a pathway of interest. In most applications, SNV-to-gene annotation is based on SNV position with respect to gene boundaries. This approach is ill-suited for incorporating non-coding SNVs, which can regulate gene expression over long distances and represent a large proportion of risk variants for Alzheimer's disease (AD). Here, we compare the performance of AD pathway-PRS across SNV-to-gene annotation strategies that integrate varying levels of functional genomic data, including adult brain chromatin interaction and expression quantitative trait loci (eQTL) data. In the UK Biobank (n=328,526), including AD cases defined by ICD-9/10 codes (n=3,043) and by family history of AD/dementia (n=38,589), we show that the annotation strategy integrating chromatin interaction and eQTL data consistently improves pathway-PRS performance. We replicate this finding in independent data from the Alzheimer's Disease Genetics Consortium (n=3,370). We further find that pathway-PRS associations with AD vary by annotation strategy and that power to detect sex-dependent and age-at-onset associations is increased with integrative annotation. Together, these findings support the use of functionally informed SNV-to-gene annotation for pathway-PRS construction and highlight the importance of applying multiple annotation strategies for robust inference.
Zhang, L.; Paterson, A. D.; Sun, L.
Show abstract
Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.
Nouira, A.; Favre Moiron, M.; Tournaire, M.; Verbanck, M.
Show abstract
Genome-wide association studies (GWAS) have identified numerous genetic variants associated with complex traits. However, linkage disequilibrium (LD) confounds these associations, leading to false positives where non-causal variants appear associated because they are correlated with nearby causal variants. This is particularly the case in highly polygenic traits where the genome can be saturated in causal variants. To address this issue, we propose LDeconv a method based on truncated singular value decomposition (SVD) that adjust GWAS summary statistics without requiring individual-level genotype data. This approach accounts for LD structure, isolates causal variants in high-LD regions, and improve the reliability of effect size estimates. We assess its performance through simulations across various LD scenarios, conduct extensive sensitivity analyses, and apply them to real GWAS data from the UK Biobank. Our results demonstrate that LDeconv effectively reduces false discoveries while preserving true associations, offering a robust framework for post-GWAS analysis.
Ahlqvist, V. H.; Sjoqvist, H.; Gardner, R. M.; Lee, B. K.
Show abstract
Background: Sibling-matched designs control for shared familial confounding but remain vulnerable to non-shared confounders. Bi-directional sensitivity analyses, which stratify families by whether the older or younger sibling was exposed, are commonly used to assess carryover effects. We aimed to demonstrate how this methodological approach can introduce severe confounding by parity. Methods: We conducted simulations motivated by a recent epidemiological study. The true causal effect of a hypothetical exposure (prenatal acetaminophen) on neurodevelopmental outcomes was set to strictly null. To introduce parity-related confounding, baseline exposure and outcome probabilities were varied slightly by birth order. We compared conditional logistic regression effect estimates from total sibling models against bi-directional stratified models. Results: In the total simulated sibling cohort, models yielded the true null effect (odds ratio = 1.00) when adjusting for parity. However, the bi-directional analyses exhibited divergent artifactual signals. Because parity is perfectly collinear with exposure in these stratified subsets, it cannot be adjusted for. For example, when the older sibling was exposed, the odds ratio for autism spectrum disorder was 1.68; when the younger was exposed, the odds ratio was 0.60. Conclusions: Divergent estimates in bi-directional sibling analyses can be a predictable artifact of parity confounding rather than evidence of carryover effects or invalidating unmeasured bias. Overall sibling models adjusting for parity may remain robust despite divergent stratified sensitivity results.
Nastou, K.; Ottosson, F. A.; Schmidt, A.; Corn, G.; Geller, F.; Grundvad Boelt, S.; MacSween, N.; Wohlfahrt, J.; Lund, M.; Melbye, M.; Ernst, M.; Feenstra, B.
Show abstract
Congenital heart defects (CHDs) are the most common congenital malformations and often arise from perturbations during early embryonic development. Maternal metabolic disturbances in early pregnancy may contribute to CHD risk, but evidence from early first-trimester metabolomics studies is limited. We conducted an untargeted metabolomics case-control study using early first-trimester maternal plasma samples (gestational weeks 4-10) from the Danish National Birth Cohort. Metabolite profiling was performed via liquid chromatography-tandem mass spectrometry (LC-MS/MS) on 160 matched CHD case-control pairs (320 total samples). Conditional logistic regression and interaction analysis were used to identify metabolites associated with CHD risk or specific cardiac phenotypes. A total of 1,471 metabolite features were measured with 69 metabolites being associated with CHD at nominal significance (p < 0.05). These included a desaturated analog of sphingosine-1-phosphate (S1P), isoleucylproline and an arginine related metabolite. However, after false discovery rate correction for multiple testing no metabolites remained significant. While these findings do not preclude that subtle metabolic variation may exist in early pregnancy among CHD cases, they also underscore the challenges of biomarker discovery in this context. This work highlights the potential of early-pregnancy metabolomics for CHD biomarker discovery, and points toward more targeted future studies with improved sample collection protocols, pre-specified pathway panels, and phenotype-homogeneous analyses to better capture the subtle metabolic variation that may underlie CHD risk.
Gantenberg, J. R.; La Joie, R.; Heston, M. B.; Ackley, S. F.
Show abstract
Qualitative models of Alzheimers pathology often posit that amyloid accumulation follows a sigmoid curve, indicating that the rate of deposition wanes over time. Longitudinal PET data now allow us to investigate amyloid accumulation trajectories with greater detail and over longer follow-up periods. We combine inferences from simulated amyloid trajectories, empirical PET data from the Alzheimers Disease Neuroimaging Initiative (ADNI), and the sampled iterative local approximation algorithm (SILA) to assess whether amyloid accumulation reaches a physiologic ceiling. We find that SILA reliably detects a ceiling, when present, across a range of simulated scenarios that impose a sigmoid shape. When fit to empirical data from ADNI, however, SILA does not appear to indicate the presence of a ceiling. Thus, we conclude that amyloid trajectories may not reach a physiologic ceiling during the stages of Alzheimers disease typically observed while patients remain under follow-up in cohort studies. Fits using SILA indicate that illustrative models of biomarker cascades, while useful tools for conceptualizing and interrogating pathologic processes, may not represent the shapes of amyloid trajectories accurately. Summary for General PublicAmyloid, a protein implicated in Alzheimers disease, is thought to reach a plateau in the brain, but methods that estimate how amyloid changes over time suggest it grows unabated. Gantenberg et al. use one such method and simulations to argue that amyloid does not reach a plateau during the typical course of Alzheimers.
Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.
Show abstract
Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.
Satterstrom, F. K.; Jodeiry, K.; Mahjani, B.; Hatem, G.; Park, S. J.; Klei, L.; Fu, J. M.; Wigdor, E. M.; the Autism Sequencing Consortium, ; Betancur, C.; Daly, M. J.; Roeder, K.; Devlin, B.; Buxbaum, J. D.; Cutler, D. J.
Show abstract
Autism spectrum disorder (ASD) is estimated to be up to four times as common in males as in females, yet the causes of this prevalence difference are not well established. One possible driver is genetic variation on the X chromosome, as it contains genes capable of contributing to ASD (e.g., PTCHD1, MECP2) and is known to play a role in genetic disorders with differential sex prevalence (e.g., color blindness). However, a lack of power compared to the autosomes combined with the complexities of modeling its biology have led to the X being largely overlooked in sequencing studies. Here, we develop quantitative X-linked TADA, a new model designed specifically for application to this chromosome, and use it to analyze rare variation from 50,663 individuals with ASD (and 136,670 individuals total). We find 9 genes on the X associated with ASD at a false discovery rate (FDR) < 0.05 and an additional 9 genes at FDR < 0.2, with many of these previously identified as involved in specific neurodevelopmental disorders. Point estimates of the liability conferred by de novo variants on the X are similar in females and males, with both sexes estimates elevated >20% above the corresponding autosomal values. We also develop a general theory of how X-linked variation of any additive or non-additive effect influences liability and describe its implications for prevalence. Using this theory and our empirical results, we show how genetic variation on the X could contribute to the sex-differential prevalence of ASD.
Beer, S.; Simpkin, A. J.; Eldeeb, S. Y.; Zar, H. J.; Stein, D. J.; Dunn, E. C.; Smith, A. D. A. C.
Show abstract
Background: In prospective cohort studies, where an exposure is collected repeatedly, interest often lies in determining whether the timing of that exposure has a differential effect on a later outcome. The Structured Life Course Modeling Approach (SLCMA), where users select between temporal hypotheses of exposure specified a priori, provides one way to analyse such longitudinal data. However, few studies using SLCMA consider the effect of time-varying covariates (TVC) which may impact associations. Methods: We present a modified version of the SLCMA - called direct and mediated effects (DME)-SLCMA - which corrects for TVC. We first develop the DME-SLCMA method, test it through simulation, and apply it to psychosocial data from the Drakenstein Child Health Study (DCHS, n=336) to investigate relationships between maternal psychopathology, TVC of socioeconomic status, and offspring depressive symptoms. Results: We found that, on average, offspring depressive symptoms score increased by 3.9% (95% CI: 1.0%-6.9%, p = 0.039) for each unit of maternal psychopathology (SRQ) at 48 months whilst adjusting for time-varying socioeconomic status (at 18, 30, 42 and 54 months). Our simulations identified several realistic scenarios where selections ignoring TVC - with TVC mediated exposure effects present - were prone to be incorrect, including our DCHS example. Conclusion: DME-SLCMA is a robust new approach for life course modelling in the presence of time-varying covariates. We recommend adjusting for TVC whenever possible, and, when not possible, our simulation study identified that scenarios where mediated effects are comparable, or greater, in magnitude to direct effects are most prone to confounding.
Larsen, T. E.; Lorca, M. H.; Ekstrom, C. T.; Vinding, R.; Bonnelykke, K.; Strandberg-Larsen, K.; Petersen, A. H.
Show abstract
Childhood weight development, especially overweight and obesity, has been associated with mental health, but their dynamic, causal relationships, and whether these differ by sex, remain unclear. We applied causal discovery to data from the Danish National Birth Cohort (n=67,593) spanning six periods from pregnancy to late adolescence and considering 67 variables related to child and parental weight, mental health, lifestyle, and socio-economic factors. We found no statistically significant difference between the causal graphs for boys and girls (P=0.079). The data-driven models found causal influence of childhood weight on subsequent weight status. Mental health pathways were exclusively within or across adjacent periods and centered on early adolescent stress. We examined the interplay between a subset of mental health variables, containing information on externalizing and internalizing problems, and weight, and found no direct causal pathway between the two processes. These findings suggest that observed links between weight and these mental health measures may be attributable to confounding. Our findings demonstrate the value of data-driven causal discovery in large cohort studies and how to test for differences in causal mechanisms across subgroups. Results are available in an interactive application, enabling future research to further explore the interplay between weight and mental health.
Zheng, W.; Liu, T.; Xu, L.; Xie, Y.; Jing, Y.; Shao, H.; Zhao, H.
Show abstract
Phenome-wide association studies (PheWAS) enable systematic exploration of relationships between genetic variants and clinical phenotypes derived from electronic health records (EHRs). Conventional regression-based PheWAS treats phenotypes separately and relies on binary phenotype representations, which limits statistical power for rare variants and rare phenotypes and reduces the ability to detect associations with phenotypes that are distributed across clinical codes. To address this limitation, we first developed EmbedPheScan, a phenotype embedding-based prioritization framework that summarizes the phenotypic profiles of rare loss-of-function variant carriers in a continuous embedding space. We then proposed EA-PheWAS by combining these embedding-derived signals with conventional regression-based PheWAS results using the aggregated Cauchy association test. Using the UK Biobank whole-exome sequencing and EHR data, we show that the proposed methods maintain appropriate false-positive control. We then performed genome-wide phenome scans across all genes and across biologically defined gene classes to evaluate EA-PheWAS relative to conventional PheWAS and EmbedPheScan, consistently finding that EA-PheWAS outperformed the other two methods. We illustrate the utility of EA-PheWAS focusing on four genes representing distinct scenarios, including strong-effect disease genes (PKD1, PKD2), genes with large numbers of rare LoF carriers (NF1), and genes with extremely sparse carrier counts (FBN1).
Zhu, Z.; Shan, S.
Show abstract
BackgroundSeveral lipid ratios have been linked to obstructive sleep apnea (OSA) risk in NHANES, yet two questions central to clinical translation remain unanswered: how much of the association is carried by central adiposity, and whether the dose-response curve contains an actionable threshold. We addressed both for the remnant cholesterol-to-HDL-C ratio (RC/HDL-C). MethodsWe analysed 3,635 adults aged [≥]20 years from NHANES 2015-2018. OSA risk was ascertained from the Sleep Disorders Questionnaire. Multivariable logistic regression estimated odds ratios across three nested models. Restricted cubic splines and segmented regression characterised the dose-response and located the inflection point. Mediation by body roundness index (BRI) was quantified by nonparametric percentile bootstrap (1,000 resamples). Discrimination was compared by ROC analysis, with stratified and trimmed-sample sensitivity analyses. ResultsOSA risk was identified in 1,361 participants (37.4%). Each one-unit rise in RC/HDL-C carried 23% higher adjusted odds of OSA (OR 1.23, 95% CI 1.03-1.47); the highest quartile carried 49% higher odds than the lowest (P-trend < 0.001). The dose-response was nonlinear, with an inflection at RC/HDL-C = 0.232: below this point each 0.1-unit increase raised odds by 54% (OR 1.54, 95% CI 1.16-2.05); above it the curve plateaued. BRI mediated 82.7% of the total effect (ACME 0.039, P < 0.001), with the indirect pathway 2.8 times stronger in women. AUCs were 0.599 (BRI) and 0.564 (RC/HDL-C). ConclusionsRC/HDL-C showed a modest, threshold-shaped association with OSA risk in U.S. adults, with central adiposity (BRI) as the predominant mediating factor. These exploratory findings, based on questionnaire-defined OSA, warrant prospective validation in cohorts with polysomnography.
Jaishankar, D.; Gjorgjieva, T.; Jala, J.; Swigert, J.; Young, A. S.; Benjamin, D. J.; Cesarini, D. A.; Turley, P.
Show abstract
We introduce a novel approach, Genomic-Relatedness-Matched Association (GRMA) studies, as an alternative to genome-wide association studies (GWAS). GWAS are typically restricted to samples of mostly unrelated individuals with a single, shared continental ancestry and nevertheless can still be biased by gene-environment correlation and assortative mating. In contrast, GRMA can be implemented in ancestrally diverse samples--retaining individuals of mixed or underrepresented ancestries and eliminating the need to assign labels to ancestry groups--and can reduce bias relative to standard GWAS. GRMA matches each individual to a group of controls whose pairwise relatedness with the individual exceeds a user-specified threshold. It generates SNP-level summary statistics based on within-group associations. In applications using the UK Biobank and All of Us data, we find that GRMA compares favorably to GWAS methods in terms of bias, precision, and population coverage. GRMA enables several novel findings; for example, we find that "genetic nurture" is unlikely to be an important source of genome-wide bias in population GWAS of body mass index, height, and educational attainment. The method is computationally efficient and supported by open-source software, facilitating its application in large-scale scientific and health-related studies.
Zevounou, J.; Lo, K. S.; McGinnis, C. S.; Satpathy, A. T.; Lettre, G.
Show abstract
Genome-wide association studies (GWAS) have identified thousands of non-coding variants associated with complex traits and diseases. However, it remains challenging to pinpoint the causal genes that are regulated by associated genetic variants. Connecting causal non-coding variants with genes can rely on methods that identify direct physical interactions (e.g. chromosome conformation capture) or on probabilistic models that predict regulatory links. These statistical models take advantage of gene expression and chromatin accessibility profiles generated in cells and tissues by bulk or single-cell (sc) methodologies. Here, we tested whether using bulk or sc RNAseq/ATACseq data and corresponding predictive enhancer-to-gene models impact the prioritization of causal GWAS genes. Using non-treated and TNF-treated human endothelial cells in vitro as a well-controlled experimental system, we show that bulk and sc RNAseq/ATACseq profiles are similar and highlight the same biology (e.g. biological pathways). Despite these similarities, we show using GWAS results for coronary artery disease (CAD) and diastolic blood pressure that applying enhancer-to-gene models designed for bulk or sc methodologies can yield differences in terms of captured heritability, fine-mapped variants and linked genes. For instance, at one CAD locus, the bulk-based ABC model predicts a regulatory link with BCAR1, whereas the sc-based model scE2G prioritizes a different gene (CFDP1). On the same experimental model, our results indicate that choosing between a bulk or sc approach will influence regulatory link model predictions; this should be considered when planning functional experiments to characterize GWAS discoveries.